1 Missing data study

1.0.1 Environment Setup

Everything is defined inside setup.R

# Load utilities
suppressMessages(library(here))

# Load utilities
source(here("src", "setup.R"))

Patterns of Missing ValuesMissing Data MechanismsGeneration of MAR DataPerformance metrics between DatasetsOverview of Imputation TechniquesImputation Comparison on Synthetic Data

1.1 Missing value patterns

In statistical analysis, understanding the patterns of missing data is crucial for selecting appropriate handling methods. The primary patterns are:

Univariate Missing Data: Occurs when only a single variable has missing values. This pattern is uncommon across most disciplines and typically arises in experimental studies.

Monotone Missing Data: Characterized by a situation where the missingness of one variable implies the missingness of all subsequent variables. This pattern is often observed in longitudinal studies where participants drop out and do not return. The monotone pattern is generally easier to manage, as the missingness follows a clear, observable sequence.

Non-Monotone Missing Data: Occurs when the missingness of one variable does not predict the missingness of other variables. This pattern is more complex and requires careful analysis to handle appropriately.

UnivariateMonotoneNon Monotonex1x2x3x4MissingPresent

1.2 Missing value generation mechanisms

Another way to classify missing values is based on their underlying mechanism.

Rubin’s missing data theory introduces a dataset \(Y\) that is partitioned into observed values (\(Y_o\)) and missing values (\(Y_m\)). The presence or absence of data is tracked by an indicator matrix \(R\) defined for each element of \(Y\) as:

\[ R = \begin{cases} 0 & \text{if } Y \text{ is observed} \\ 1 & \text{if } Y \text{ is missing} \end{cases} \]

Missing Completely at Random (MCAR) defines a mechanism where the probability of missingness \(P(R|q)\) has no relationship with any values in the dataset. Implementation can be univariate, where missing values are inserted through random selection or Bernoulli trials with probability \(p\), or multivariate, where missing values are distributed either uniformly or randomly across the entire dataset.

x1x2x3x4Data with randomly missing values

Missing At Random (MAR/CAR) occurs when missingness probability \(P(R|Y_o, q)\) depends on observed data patterns but not on missing values. For univariate cases, missing values are determined by the patterns of observed data, such as rank-based probabilities or percentile thresholds. In multivariate cases, missingness arises due to relationships or correlations between feature pairs.

x1x2x3x4Variable withmissing valuescan be explained byObserveddata

Missing Not at Random (MNAR): Missingness depends on both the observed and unobserved data \(p(R|Y_0,Y_m,q)\), making it difficult to model and handle. Identifying and managing MNAR data requires complex assumptions about the relationship between missing and observed values.

x1x2x3x4Variable withmissing valuesx5x6ObserveddataUnobserveddataMissingness is linked to

This study focuses on Missing at Random (MAR) data due to its practical relevance and analytic tractability compared to Missing Completely at Random (MCAR) and Missing Not at Random (MNAR).

We exclude Missing Completely at Random (MCAR) data because its randomness preserves the dataset’s underlying distribution, making missingness equivalent to reduced sample size without introducing bias. Handling MNAR (Missing Not at Random) data is beyond this study’s scope because identifying and modeling its dependence on unobserved variables is often ambiguous and requires unverifiable assumptions.

1.2.1 Synthetic MAR mechanism

Let’s create a simple dataset to demonstrate the different mechanisms of missing values and explore various ways to visualize them.

synthetic_data <- synthetic_dataset_gen(n_samples = 500, 
                                        n_covariates = 5, 
                                        correlation = "none", 
                                        target_type = "linear", 
                                        noise_level = 0.5)

1.2.1.1 Censoring method

The delete_MAR_censoring function introduces missing values in a dataset based on a MAR mechanism. Missingness depends on values in a control column (cols_ctrl). Censoring mechanism:

  • Sorting based: Rows are sorted by the control column, and missing values are added to the smallest, largest, or both extremes.
  • Quantile based: Missing values are based on calculated quantiles of the control column, without sorting.

sort control15724control63cdefgabtarget54321control76dgbfceatargetcensor under a value

data.MAR.uni <- delete_MAR_censoring(synthetic_data, p = 0.1, cols_mis = c('X1'), cols_ctrl = c('X2'))

# Create a new dataset indicating missing status
intersection_dataset <- synthetic_data
intersection_dataset$missing <- ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")

plot_missing_data(data.MAR.uni, "MAR uni", c("white", red_nord))

# Calculate global ranges for X1 and all Y variables (X2-X5)
x_range <- range(synthetic_data$X1, na.rm = TRUE)
y_range <- range(synthetic_data[,paste0("X", 2:5)], na.rm = TRUE)

# Add padding to ranges
x_pad <- diff(x_range) * 0.05
y_pad <- diff(y_range) * 0.05

# Plot for X1 vs X2
plot_data2 <- data.frame(
  X1 = synthetic_data$X1,
  Y = synthetic_data$X2,
  missing = ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
)

scatter_plot2 <- ggplot(plot_data2, aes(x = X1, y = Y, color = missing)) +
  geom_point() +
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
  scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
  scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
  theme_minimal() +
  labs(color = "Data Status", x = "X1", y = "X2") +
  theme(legend.position = "none") +
  coord_fixed()

final_plot2 <- ggMarginal(scatter_plot2, margins = "y", 
                         groupColour = TRUE, groupFill = TRUE,
                         type = "boxplot", size = 10, width = 0.15, outlier.size = 1)

# Plot for X1 vs X3
plot_data3 <- data.frame(
  X1 = synthetic_data$X1,
  Y = synthetic_data$X3,
  missing = ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
)

scatter_plot3 <- ggplot(plot_data3, aes(x = X1, y = Y, color = missing)) +
  geom_point() +
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
  scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
  scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
  theme_minimal() +
  labs(color = "Data Status", x = "X1", y = "X3") +
  theme(legend.position = "none") +
  coord_fixed()

final_plot3 <- ggMarginal(scatter_plot3, margins = "y",
                         groupColour = TRUE, groupFill = TRUE,
                         type = "boxplot", size = 10, width = 0.15, outlier.size = 1)

# Plot for X1 vs X4
plot_data4 <- data.frame(
  X1 = synthetic_data$X1,
  Y = synthetic_data$X4,
  missing = ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
)

scatter_plot4 <- ggplot(plot_data4, aes(x = X1, y = Y, color = missing)) +
  geom_point() +
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
  scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
  scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
  theme_minimal() +
  labs(color = "Data Status", x = "X1", y = "X4") +
  theme(legend.position = "none") +
  coord_fixed()

final_plot4 <- ggMarginal(scatter_plot4, margins = "y",
                         groupColour = TRUE, groupFill = TRUE,
                         type = "boxplot", size = 10, width = 0.15, outlier.size = 1)

# Plot for X1 vs X5
plot_data5 <- data.frame(
  X1 = synthetic_data$X1,
  Y = synthetic_data$X5,
  missing = ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
)

scatter_plot5 <- ggplot(plot_data5, aes(x = X1, y = Y, color = missing)) +
  geom_point() +
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
  scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
  scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
  theme_minimal() +
  labs(color = "Data Status", x = "X1", y = "X5") +
  theme(legend.position = "none") +
  coord_fixed()

final_plot5 <- ggMarginal(scatter_plot5, margins = "y",
                         groupColour = TRUE, groupFill = TRUE,
                         type = "boxplot", size = 10, width = 0.15, outlier.size = 1)

# Arrange the plots in a 2x2 grid
grid.arrange(final_plot2, final_plot3, final_plot4, final_plot5, 
            ncol = 2,
            top = textGrob("MAR censoring method visualisation", gp = gpar(fontsize = 16, font = 2)))

1.2.1.2 Likelihood method

The delete_MAR_1_to_x() function in R creates MAR values by controlling missingness in one column (cols_mis) based on the values of another column (cols_ctrl). It splits the data into two groups using a cutoff value (e.g., median) in cols_ctrl. Group 1 contains rows below the cutoff, and group 2 contains rows above. The x parameter sets the odds ratio of missing data: for x = 3, group 2 is 3 times more likely to have missing values than group 1.

23334control12cdefgabtarget22334control12edefagftargetSplit in two groupsaccording to the median4h4hgroup 1group 2deleting a variable in the group1 is x likely wrt to group2

data.MAR.multi <- delete_MAR_1_to_x(synthetic_data, 
                                    p = 0.2, 
                                    cols_mis = c('X1'), 
                                    cols_ctrl = c('X3'), 
                                    x = 50)

data.MAR.multi <- delete_MAR_1_to_x(data.MAR.multi, 
                                    p = 0.2, 
                                    cols_mis = c('X1'), 
                                    cols_ctrl = c('X4'), 
                                    x = 50)

data.MAR.multi <- delete_MAR_1_to_x(data.MAR.multi, 
                                    p = 0.2, 
                                    cols_mis = c('X2'), 
                                    cols_ctrl = c('X5'), 
                                    x = 50)


plot_missing_data(data.MAR.multi, "MAR multi", c("white", red_nord))

# Calculate global ranges for X1 and all Y variables (X2-X5)
x_range <- range(synthetic_data$X1, na.rm = TRUE)
y_range <- range(synthetic_data[,paste0("X", 2:5)], na.rm = TRUE)

# Add padding to ranges
x_pad <- diff(x_range) * 0.05
y_pad <- diff(y_range) * 0.05

# Plot for X1 vs X2
plot_data2 <- data.frame(
  X1 = synthetic_data$X1,
  Y = synthetic_data$X2,
  missing = ifelse(is.na(data.MAR.multi$X1), "Missing", "Present")
)

scatter_plot2 <- ggplot(plot_data2, aes(x = X1, y = Y, color = missing)) +
  geom_point() +
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
  scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
  scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
  theme_minimal() +
  labs(color = "Data Status", x = "X1", y = "X2") +
  theme(legend.position = "none") +
  coord_fixed()

final_plot2 <- ggMarginal(scatter_plot2, margins = "y", 
                         groupColour = TRUE, groupFill = TRUE,
                         type = "boxplot", size = 10, width = 0.15, outlier.size = 1)

# Plot for X1 vs X3
plot_data3 <- data.frame(
  X1 = synthetic_data$X1,
  Y = synthetic_data$X3,
  missing = ifelse(is.na(data.MAR.multi$X1), "Missing", "Present")
)

scatter_plot3 <- ggplot(plot_data3, aes(x = X1, y = Y, color = missing)) +
  geom_point() +
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
  scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
  scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
  theme_minimal() +
  labs(color = "Data Status", x = "X1", y = "X3") +
  theme(legend.position = "none") +
  coord_fixed()

final_plot3 <- ggMarginal(scatter_plot3, margins = "y",
                         groupColour = TRUE, groupFill = TRUE,
                         type = "boxplot", size = 10, width = 0.15, outlier.size = 1)

# Plot for X1 vs X4
plot_data4 <- data.frame(
  X1 = synthetic_data$X1,
  Y = synthetic_data$X4,
  missing = ifelse(is.na(data.MAR.multi$X1), "Missing", "Present")
)

scatter_plot4 <- ggplot(plot_data4, aes(x = X1, y = Y, color = missing)) +
  geom_point() +
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
  scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
  scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
  theme_minimal() +
  labs(color = "Data Status", x = "X1", y = "X4") +
  theme(legend.position = "none") +
  coord_fixed()

final_plot4 <- ggMarginal(scatter_plot4, margins = "y",
                         groupColour = TRUE, groupFill = TRUE,
                         type = "boxplot", size = 10, width = 0.15, outlier.size = 1)

# Plot for X1 vs X5
plot_data5 <- data.frame(
  X1 = synthetic_data$X1,
  Y = synthetic_data$X5,
  missing = ifelse(is.na(data.MAR.multi$X1), "Missing", "Present")
)

scatter_plot5 <- ggplot(plot_data5, aes(x = X1, y = Y, color = missing)) +
  geom_point() +
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
  scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
  scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
  theme_minimal() +
  labs(color = "Data Status", x = "X1", y = "X5") +
  theme(legend.position = "none") +
  coord_fixed()

final_plot5 <- ggMarginal(scatter_plot5, margins = "y",
                         groupColour = TRUE, groupFill = TRUE,
                         type = "boxplot", size = 10, width = 0.15, outlier.size = 1)

# Arrange the plots in a 2x2 grid
grid.arrange(final_plot2, final_plot3, final_plot4, final_plot5, 
            ncol = 2,
            top = textGrob("MAR likelihood method visualisation", gp = gpar(fontsize = 16, font = 2)))

2 Imputation techniques

This section reviews imputation techniques for handling missing data, from simple deletion methods to advanced machine learning approaches, highlighting their uses and impact on data reliability.

Imputation Technique Description Strengths Limitations
List-wise Deletion Removes cases with missing values. Simple; works for MCAR data. Causes data loss and bias if MCAR is violated.
Pairwise Deletion Removes data points only when necessary. Retains more data; less bias for MCAR/MAR. Inconsistent sample sizes; potential bias.
Simple Imputation Replaces missing values with mean, median, or mode. Easy and efficient. Can distort data and introduce bias.
Regression Imputation Predicts missing values using regression. Preserves sample size; good for MAR. Assumes linearity; less effective for non-linear data.
Hot-deck Imputation Uses similar cases to fill missing values. Avoids model assumptions. Less reliable for small datasets.
Expectation-Maximization (EM) Iteratively estimates missing values. Accurate imputations; handles MAR well. Computationally intensive.
Multiple Imputation Generates multiple datasets to reflect uncertainty. Reduces bias; more robust results. Time-consuming; complex to combine results.
GAM Imputation Uses Generalized Additive Models for non-linear data. Flexible for non-linear data. More complex and computationally expensive.
Decision Tree / Random Forest Builds models using decision trees. Captures complex patterns; handles various data types. Computationally heavy; prone to overfitting.
K-Nearest Neighbors (KNN) Imputes based on nearest neighbors. Flexible and captures patterns. High computational cost for large datasets.

2.1 Performance metrics between Datasets

Since we possess the original dataset prior to the introduction of missing values, we will introduce two metrics to compare the distributions of two datasets: the original dataset without missing values and the dataset in which missing values have been imputed using various techniques

We indeed care about comparing the distribution of the two datasets more than comparing pointwise the actual reconstruction of the dataset.

  • The Jensen–Shannon Divergence (JSD) quantifies the similarity or divergence between two probability distributions \(P\) and \(Q\). It is based on the midpoint distribution \(M = \frac{1}{2}(P + Q)\) and is calculated as the average of the Kullback-Leibler (KL) divergences of \(P\) and \(Q\) relative to \(M\). JSD is symmetric and always non-negative, with values ranging between \(0\) (identical distributions) and \(\log(2)\) (distributions with disjoint supports).

  • The Wasserstein distance provides a way to measure the cost of transforming one probability distribution into another. It can be viewed as the minimum “effort” required to transport all the mass of one distribution to match the other, where the cost depends on both the amount of mass moved and the distance it travels. Unlike divergence-based measures (e.g. KL or JS divergence), Wasserstein distance naturally accounts for the geometry of the sample space, making it particularly useful when the shapes or locations of the distributions differ.

2.2 Visualization of the different techiques

The following paragraph presents visualizations of various imputation techniques applied to different types of datasets to observe their impact. The reconstructed datasets are then evaluated and compared using two distribution-based metrics.

# Define methods with their display names and corresponding functions
imputation_methods <- list(
  "Original Dataset" = function(data) { data }, # Custom if statement for this
  "Mean Imputation" = function(data) { simple_imputation(data, "mean") },
  "Hot Deck" = function(data) { hot_deck_imputation(data) },
  "KNN Imputation" = function(data) { imputed_data <- kNN(data, imp_var = FALSE)},
  "Regression" = function(data) { regression_imputation(data) },
  "Regression with Noise" = function(data) { regression_imputation(data, noise = TRUE) },
  "GAM" = function(data) { gam_based_imputation(data) },
  "GAM with Noise" = function(data) { gam_based_imputation(data, noise = TRUE) },
  "Random Forest" = function(data) { tree_based_imputation(data) },
  "Random Forest with Noise" = function(data) { tree_based_imputation(data, noise=TRUE) }
)

2.2.1 Linear Relationship

We begin by testing the imputation techniques on data characterized by a linear relationship and heteroscedasticity. Methods such as regression imputation are expected to perform well in this scenario, as they are designed to capture and model linear patterns within the data.

2.2.1.1 Analyses of noise in regression imputation

We begin by demonstrating how various noise strategies can be incorporated into the imputation methods, as this technique will be applicable in future scenarios as well.

# Generate and create missing data
n <- 200
p <- 0.3
linear_data <- generate_data(n, "linear",x_range = c(0, 1),  noise_sd = 0.5, 
                             homoscedasticity = F, min_sd_noise = 0.1)

# linear_missing <- delete_MAR_1_to_x(linear_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 100)
# linear_missing <- delete_MAR_1_to_x(linear_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 10)
linear_missing <- delete_MAR_1_to_x(linear_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 10)

noise_imputation_methods <- list(
  "Regression" = function(data) { regression_imputation(data) },
  "Regression with Noise" = function(data) { regression_imputation(data, noise = TRUE) }
)

# Use the new combined analysis function
noise_result <- plot_imputations_and_metrics(original_data = linear_data, missing_data = linear_missing, imputation_methods = noise_imputation_methods)

# Access the list of individual plots
noise_plots <- noise_result$imputation_plots

# Arrange and display them as needed
grid.arrange(grobs = noise_plots, ncol = 2)

2.2.1.2 Comparison of the different techniques for the simple dataset

# Use the new combined analysis function
linear_result <- plot_imputations_and_metrics(original_data = linear_data, missing_data = linear_missing, imputation_methods = imputation_methods)

# Access the list of individual plots
imputation_plots <- linear_result$imputation_plots

# Arrange and display them as needed
grid.arrange(grobs = imputation_plots, ncol = 2)

# Access the metrics plot
metrics_plot <- linear_result$metrics_plot
print(metrics_plot)

2.2.2 Quadratic Relationship

The second dataset is constructed with a quadratic relationship and an unbalanced number of data points. In cases involving quadratic relationships, we anticipate that more flexible imputation methods, such as Generalized Additive Models (GAM) and Random Forest, will outperform simple linear regression due to their ability to capture complex, non-linear patterns in the data. The relationship is generated via: \(x^2 + \epsilon\).

# Generate and create missing data
quad_data <- generate_data(n, "quadratic",x_range = c(-2, 2), noise_sd = 1, balanced = F, alpha = 1, beta = 2)
quad_missing <- delete_MAR_1_to_x(quad_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 10)

quad_result <- plot_imputations_and_metrics(quad_data, quad_missing, imputation_methods)

# Access the list of individual plots
imputation_plots <- quad_result$imputation_plots

# Arrange and display them as needed
grid.arrange(grobs = imputation_plots, ncol = 2)

# Access the metrics plot
metrics_plot <- quad_result$metrics_plot
print(metrics_plot)

2.2.3 Piecewise

The third dataset is designed with a piecewise relationship, where distinct segments of the data follow different patterns. For this type of distribution, we expect imputation methods capable of handling discontinuities and local variations, such as Random Forest and k-Nearest Neighbors, to perform better than global modeling approaches like linear regression, which may struggle to accurately capture the segmented structure of the data. To avoid missingness in only one of the pieces, we lowered the odds to 3 instead of the usual 10.

The dependent variable \(x_2\) is generated using the following piecewise relationship:

\[ x_2 = \begin{cases} 2 + 2x_1 + \epsilon, & \text{if } x_1 < \text{pivot} \\ 15 + 5x_1 + \epsilon, & \text{if } x_1 \geq \text{pivot} \end{cases} \]

Here, \(\text{pivot} = \frac{x_{\text{range}[1]} + x_{\text{range}[2]}}{2}\), and \(\epsilon\) is a random noise term added to introduce variability.

# Generate and create missing data
piecewise_data <- generate_data(n, "piecewise", noise_sd = 2)
piecewise_missing <- delete_MAR_1_to_x(piecewise_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 3)

piecewise_result <- plot_imputations_and_metrics(piecewise_data, piecewise_missing, imputation_methods)

# Access the list of individual plots
imputation_plots <- piecewise_result$imputation_plots

# Arrange and display them as needed
grid.arrange(grobs = imputation_plots, ncol = 2)

# Access the metrics plot
metrics_plot <- piecewise_result$metrics_plot
print(metrics_plot)

2.2.4 Logarithmic Relationship

The final dataset follows a logarithmic relationship, providing insight into how different imputation methods handle non-linear but monotonic patterns. The dataset is generated by: \(3 \log(x) + \epsilon\)

# Generate and create missing data
log_data <- generate_data(n, "log", x_range = c(1, 100), noise_sd = 1)
log_missing <- delete_MAR_1_to_x(log_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 10)

log_result <- plot_imputations_and_metrics(log_data, log_missing, imputation_methods)

# Access the list of individual plots
imputation_plots <- log_result$imputation_plots

# Arrange and display them as needed
grid.arrange(grobs = imputation_plots, ncol = 2)

# Access the metrics plot
metrics_plot <- log_result$metrics_plot
print(metrics_plot)

3 Second part

3.1 Introduction

In this second part of our project, we address the challenge of missing data in real-world datasets. Unlike the previous section, our approach here begins with a complete dataset —specifically, a refined version of the well-known Auto MPG dataset. We will systematically introduce missing values into one variable, conditioned on two other variables, to simulate a Missing at Random mechanism. This will be done using functions from the VIM package in R.

Following the introduction of missing data, we will apply the various imputation techniques, previously discussed, to generate complete datasets. However, instead of simply comparing the distributions of these imputed datasets, our focus will shift to evaluating their performance in the context of linear regression models. Specifically, we will assess the impact of different imputation methods by comparing the estimated coefficients and Mean Squared Errors of the fitted models.

In the final segment of this section, we will extend our analysis by introducing missing values across multiple covariates, conditioned on the remaining variables, resulting in a dataset with a substantial proportion of missing data. To address this, we will employ multiple imputation techniques, particularly those implemented in the MICE package, which utilizes chained equations for imputing multiple variables. Subsequently, we will compare the performance of linear models fitted on these newly imputed datasets.

We have selected the Auto MPG dataset for this analysis due to its simplicity and recognition in the field. Our objective is to fit regression models starting from this dataset and systematically evaluate the effects of various imputation strategies on model performance.

Let’s start by analyzing the data, exploring its features, and examining their relationships.

3.2 Dataset exploration

The “auto-mpg” dataset tracks various technical information about car models manufactured in the 1970s and 1980s. The response variable will be mpg (miles per gallon).

First, we examine the variables and observations in the dataset:

## 'data.frame':    398 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr  "130.0" "165.0" "150.0" "150.0" ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ car_name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##       mpg          cylinders      displacement    horsepower       
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Length:398        
##  1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   Class :character  
##  Median :23.00   Median :4.000   Median :148.5   Mode  :character  
##  Mean   :23.51   Mean   :5.455   Mean   :193.4                     
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0                     
##  Max.   :46.60   Max.   :8.000   Max.   :455.0                     
##      weight      acceleration     model_year        origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2970   Mean   :15.57   Mean   :76.01   Mean   :1.573  
##  3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##    car_name        
##  Length:398        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

There are 9 variables with a total of 398 observations.

  • mpg: A continuous variable indicating the gallons of fuel used to travel one mile.

  • cylinders: A discrete quantitative variable indicating the number of engine cylinders (to be treated as a factor).

  • displacement: A continuous variable indicating engine displacement.

  • horsepower: A character variable indicating the horsepower of the car (to be treated as a continuous quantitative variable).

  • weight: A continuous variable indicating the car’s weight.

  • acceleration: A continuous variable indicating the car’s acceleration.

  • model_year: An integer variable indicating the year the car model was produced.

  • origin: A categorical variable indicating the continent of the car manufacturer (1: America, 2: Europe, 3: Asia).

  • car_name: A qualitative variable indicating the manufacturer and model name of the car.

3.2.1 Data Pre-Processing

Before building our model, we need to make the data usable. We observe anomalies in the dataset variables, so we clean the data when possible. We also assign the correct data types to variables and create two new variables: cut_model_year and car_brand.

  • cut_model_year: A categorical variable dividing production years into the classes 70-73, 74-76, 77-79, and 80-82.
  • car_brand: A categorical variable that tracks only the car’s manufacturer.
## 'data.frame':    391 obs. of  9 variables:
##  $ mpg           : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders     : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ displacement  : num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower    : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight        : num  3504 3693 3436 3433 3449 ...
##  $ acceleration  : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ origin        : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cut_model_year: Factor w/ 4 levels "70-73","74-76",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ car_brand     : Factor w/ 28 levels "amc","audi","bmw",..: 6 4 20 1 11 11 6 20 21 1 ...
##  - attr(*, "na.action")= 'omit' Named int 29
##   ..- attr(*, "names")= chr "29"
##       mpg        cylinders  displacement     horsepower        weight    
##  Min.   :10.00   3:  4     Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.25   4:199     1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2224  
##  Median :23.00   5:  3     Median :151.0   Median : 93.0   Median :2800  
##  Mean   :23.48   6: 83     Mean   :194.1   Mean   :104.2   Mean   :2973  
##  3rd Qu.:29.00   8:102     3rd Qu.:264.5   3rd Qu.:125.0   3rd Qu.:3611  
##  Max.   :46.60             Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                          
##   acceleration   origin  cut_model_year     car_brand  
##  Min.   : 8.00   1:244   70-73:123      ford     : 49  
##  1st Qu.:13.75   2: 68   74-76: 90      chevrolet: 47  
##  Median :15.50   3: 79   77-79: 93      plymouth : 31  
##  Mean   :15.53           80-82: 85      dodge    : 28  
##  3rd Qu.:17.00                          amc      : 27  
##  Max.   :24.80                          toyota   : 26  
##                                         (Other)  :183

After these steps, we obtain a new dataset that contains 9 variables and 391 observations:

  • mpg: Continuous, indicating fuel consumption per mile.
  • cylinders: Discrete, indicating the number of cylinders, categorized into 5 levels: 3, 4, 5, 6, 8.
  • displacement: Continuous, indicating engine displacement.
  • horsepower: Continuous, indicating car horsepower.
  • weight: Continuous, indicating car weight.
  • acceleration: Continuous, providing a parameter for the car’s acceleration.
  • origin: Categorical, indicating the continent of origin (1: America, 2: Europe, 3: Asia).
  • cut_model_year: Categorical, dividing production years into the classes 70-73, 74-76, 77-79, and 80-82.
  • car_brand: Categorical, tracking the car’s manufacturer.

Some variables have differences in the summary statistics due to the removal of elements with null values.

3.2.2 Exploratory Data Analysis

3.2.2.1 Continuous Variables

Let’s start by visualizing the correlation matrix of the continuous variables:

Our analysis reveals both significant positive and negative correlations among the variables. Notably, acceleration exhibits the weakest correlations with other variables, with coefficients ranging from -0.69 to 0.42. In contrast, variables such as weight, horsepower, and displacement demonstrate strong positive intercorrelations. This observation aligns with expectations, as higher displacement typically results in increased horsepower, and heavier vehicles necessitate more powerful engines. Consequently, these more powerful engines tend to consume more fuel, which is consistent with anticipated outcomes.

3.2.2.1.1 Scatterplots with mpg as the Response Variable

Scatterplots show clear patterns for the first three variables. The positive correlation between acceleration and mpg is less obvious because it has a lower absolute value.

3.2.2.1.2 Distribution of Continuous Variables

The histogram of displacement shows a positively skewed distribution with a peak around 100. The distributions of weight and horsepower are also positively skewed, with heavier right tails. The acceleration variable shows a perfectly symmetrical distribution with mean, mode, and median around 15.

3.2.2.2 Distribution of mpg

The histogram shows that mpg has a positively skewed distribution with a long right tail. This is more evident when a kernel density curve is added to the histogram, suggesting that transformations may be necessary.

3.2.2.3 Categorical Variables

We analyze categorical variables in relation to mpg using boxplots. For the high number of categories in car_brand, we use a bar plot instead.

The number of cylinders slightly influences mpg. We notice that 4-cylinder engines appear to be the most efficient (in fact, mid-to-high range compact cars generally use this type of engine, where low fuel consumption is a key requirement). Conversely, 8-cylinder engines are the least efficient in terms of fuel consumption (as 8-cylinder engines, typically V8s, are characteristic of supercars or American muscle cars).

The country where the car manufacturer is based also seems to impact fuel consumption. American manufacturers tend to produce cars with higher fuel consumption, while Asian manufacturers perform slightly better than European ones.

The year of production shows a positive trend regarding fuel efficiency as the years progress. This could be attributed to technological advancements enabling the production of more fuel-efficient cars or to the oil crises of the 1970s, which may have pushed manufacturers in this direction.

Finally, regarding the car manufacturer, it is difficult to identify very significant patterns. The only notable observation is that some brands, like Nissan, for example, have a much higher average miles-per-gallon compared to brands like Chrysler (noting that the former is an Asian brand, while the latter is American).

3.2.3 Histograms and Density Curves

3.2.3.1 Histogram of mpg by Origin

The graphs show unimodal curves with differing frequency peaks and averages based on the continent of origin:

  • American Cars: Average 20 mpg, mode 13 mpg.
  • European Cars: Average 27 mpg, mode 24 mpg.
  • Asian Cars: Average 30 mpg, mode 32 mpg.

This confirms that American cars generally consume more fuel, followed by European cars, with Asian cars being the most fuel-efficient.

3.2.3.1.1 Histogram of mpg by Number of Cylinders

The distributions are similar across categories, with high peaks followed by heavier right tails. This suggests that the number of cylinders is not a major factor in explaining mpg. However, cars with more cylinders tend to achieve fewer miles per gallon. Data on 5-cylinder cars is limited, making conclusions difficult.

3.2.3.1.2 Histogram of mpg by Production Year

Production year initially appears to have little influence, but there is a marked change in distribution patterns for cars from the 1980s, particularly compared to those from the early 1970s.

3.2.4 Development and Evaluation of a Linear Regression Model on the Complete Auto MPG Dataset

In this subsection, we consider the construction and assessment of a linear regression model utilizing the complete Auto MPG dataset. Initially, we partition the dataset into training and testing subsets, employing an 80-20 split to ensure robust model validation. Subsequently, we fit multiple linear regression models to identify the optimal combination of predictor variables. Our primary model focuses on predicting the natural logarithm of mpg based on variables such as weight, horsepower, and categorized model year. We then evaluate the performance of this model, comparing it against alternative models with different predictor combinations, to determine the most effective predictors for fuel efficiency.

fit_original <- lm(I(log(mpg))~weight+horsepower + cut_model_year,data = auto_mpg)

From the assessment of the models we obtain that the model that uses as predictors weight, horsepower and cut_model_year is the best in fitting data.

3.2.5 Generation and visualization of missing values using MAR

As anticipated, we generate missing values in the dataset with MAR mechanism. We underline that we decided to remove values on the predictor weight with respect to acceleration and displacement, since its high correlation with other variables and importance role in predicting the target.

cutoff_75th_percentile <- function(x) {quantile(x, 0.75)}

missingMar <- delete_MAR_1_to_x(data_original, p = 0.2, cols_mis = "weight",cols_ctrl = "displacement", x = 10, cutoff_fun = cutoff_75th_percentile)
missingMar<-delete_MAR_censoring(missingMar, p = 0.1, cols_mis = "weight", cols_ctrl = "acceleration")

#plot VIM
aggr_plot <- aggr(missingMar, numbers = TRUE,labels=names(data_original), prop = c(TRUE, FALSE), col = nord_contrast)

#plot 2
missing.rows = dim(missingMar)[1] -  dim(na.omit(missingMar))[1]
sprintf("Missing rows: %s (%s%%)", missing.rows, round((missing.rows*100)/dim(missingMar)[1], 2))
## [1] "Missing rows: 76 (24.2%)"
missings_df <- data.frame(type=c("missing", "non-missing") ,count = c(missing.rows,  dim(na.omit(missingMar))[1]))

ggplot(missings_df, aes(fill=type, y="", x=count)) + 
    geom_bar(position="stack", stat="identity")+
    ggtitle("Missing data frequency") +
    xlab("Obs") + ylab("") +
    theme(text = element_text(size = 18))+
    scale_fill_manual(values = c(nord_colors[2], nord_colors[3]))+
    theme_bw()

3.2.5.1 Plot missingvalues of weight against displacement and acceleration

# Calculate global ranges for both x and y variables
dis_range <- range(c(data_original$displacement, missingMar$displacement), na.rm = TRUE)
weight_range <- range(c(data_original$weight, missingMar$weight), na.rm = TRUE)
accel_range <- range(c(data_original$acceleration, missingMar$acceleration), na.rm = TRUE)

# Add padding to ranges
dis_pad <- diff(dis_range) * 0.05
weight_pad <- diff(weight_range) * 0.05
accel_pad <- diff(accel_range) * 0.05

# First plot: DIS vs. Weight
plot_data1 <- data_original[, c("weight", "displacement")]
plot_data1$missing <- ifelse(is.na(missingMar$weight), "Missing", "Present")
scatter_plot1 <- ggplot(plot_data1, aes(x = weight, y = displacement, color = missing)) +
  geom_point(size = 2) +  # Increased point size
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +  # Corrected color mapping
  scale_y_continuous(limits = c(dis_range[1] - dis_pad, dis_range[2] + dis_pad)) +
  scale_x_continuous(limits = c(weight_range[1] - weight_pad, weight_range[2] + weight_pad)) +
  theme_minimal(base_size = 12) +  # Increased base font size
  labs(color = "Data Status", x = "Weight", y = "Displacement") +
  theme(legend.position = "none")

final_plot1 <- ggMarginal(scatter_plot1, 
                          margins = "y", 
                          groupColour = TRUE, 
                          groupFill = TRUE,
                          type = "boxplot", 
                          size = 5,  
                          width = 0.2,  
                          outlier.size = 1)

# Second plot: Acceleration vs. Weight
plot_data2 <- data_original[, c("weight", "acceleration")]
plot_data2$missing <- ifelse(is.na(missingMar$weight), "Missing", "Present")

scatter_plot2 <- ggplot(plot_data2, aes(x = weight, y = acceleration, color = missing)) +
  geom_point(size = 2) +  
  scale_color_manual(values = c("Missing" = red_nord, "Present" = "black" )) +  
  scale_y_continuous(limits = c(accel_range[1] - accel_pad, accel_range[2] + accel_pad)) +
  scale_x_continuous(limits = c(weight_range[1] - weight_pad, weight_range[2] + weight_pad)) +
  theme_minimal(base_size = 12) +  
  labs(color = "Data Status", x = "Weight", y = "Acceleration") +
  theme(legend.position = "none")

final_plot2 <- ggMarginal(scatter_plot2, 
                          margins = "y", 
                          groupColour = TRUE, 
                          groupFill = TRUE,
                          type = "boxplot", 
                          size = 5,  
                          width = 0.2,  
                          outlier.size = 1)

# Combine the two plots using gridExtra::grid.arrange()
grid.arrange(final_plot1, final_plot2, ncol=1)

3.2.5.2 3D Plot to showcase the relation better

In this section, we employ a 3D scatter plot to elucidate the relationships among ‘displacement’, ‘acceleration’, and ‘weight’ within our dataset. This visualization not only highlights the interplay between these variables but also distinctly marks the data points where ‘weight’ values are missing.

# Assuming `data_original` is your original dataset and `missingMar` is the modified one
plotMar_data <- data_original %>%
  mutate(missing = ifelse(is.na(missingMar$weight), "Missing", "Present"))  # Flag missing/present

# Create the 3D plot based on `plotMar_data`

plot_ly(data = plotMar_data, 
        x = ~displacement, 
        y = ~acceleration, 
        z = ~weight,  # Use original weight for z-axis
        color = ~missing, 
        colors = c("Present" = "black", "Missing" = red_nord),  # Ensure 'red' is defined
        type = 'scatter3d', 
        mode = 'markers', 
        marker = list(size = 4)) %>%  # Adjust point size here
  plotly::layout(
    title = '3D Plot of Missing vs Present Data',
    scene = list(
      xaxis = list(title = 'Displacement'),
      yaxis = list(title = 'Acceleration'),
      zaxis = list(title = 'Weight')
    )
  )

3.2.6 Imputation Techniques and Linear Model Fitting

In this section, we address the challenge of missing data within our dataset by applying various imputation methods. The goal is to assess how different imputation strategies influence the performance and outcomes of linear regression models.

3.3 mean, median, listwise deletion, regression, random forest, gam

imp_ds_mean <- impute_mean(missingMar)
imp_ds_median <- impute_median(missingMar)
imp_del_list <- listwise_deletion(missingMar)
imp_ds_reg <- regression_imputation(missingMar[, -which(names(missingMar) == "car_brand")], noise = TRUE)
imp_ds_rf <- tree_based_imputation(missingMar[, -which(names(missingMar) == "car_brand")], noise = TRUE)

# sub dataset for Gam model
prova <- missingMar[, !names(missingMar) %in% c("car_brand", "origin", "cylinders", "cut_model_year")]
imp_ds_gam <- gam_based_imputation(prova, noise = TRUE, max_predictors = 4)
imp_ds_gam <- cbind(imp_ds_gam, missingMar[, c("car_brand", "origin", "cylinders", "cut_model_year")])

Post-imputation, we fit a linear regression model to each imputed dataset. The models predict the logarithm of miles per gallon (log(mpg)) using predictors such as weight, horsepower, and cut_model_year. This approach enables us to evaluate and compare the impact of each imputation method on the regression coefficients and overall model performance.

By systematically applying these imputation techniques and fitting corresponding linear models, we aim to identify the most effective strategies for handling missing data in our analysis.

fit_mean<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_mean)
fit_median<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_median)
fit_list<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_del_list)
fit_reg<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_reg)
fit_rf<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_rf)
fit_gam<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_gam)

3.3.1 Summary and Analysis of coefficients

In this section, we present a comprehensive summary and analysis of the coefficients derived from linear models fitted to datasets subjected to various imputation techniques. The imputation methods employed include mean imputation, median imputation, listwise deletion, regression imputation, random forest imputation, and GAM based imputation. For each imputed dataset, a linear model was fitted with the logarithm of miles per gallon as the response variable and weight, horsepower, and cat_model_year as predictor variables.

We begin by summarizing the fitted models, highlighting key statistics such as coefficient estimates, standard errors, and significance levels. Subsequently, we conduct a detailed coefficient analysis, focusing on the estimates and their confidence intervals for each predictor across the different imputation methods. This analysis is further visualized through plots that compare the coefficients and their confidence intervals, providing insights into the variability and stability of the estimates resulting from the different imputation strategies.

By examining these comparisons, we aim to assess the impact of various imputation methods on the model coefficients and to identify which methods yield the most reliable and consistent estimates.

3.3.1.1 Summary

summary(fit_mean)
## 
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year, 
##     data = imp_ds_mean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41008 -0.08189  0.00002  0.09071  0.41170 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.927e+00  3.875e-02 101.342  < 2e-16 ***
## weight              -1.555e-04  1.523e-05 -10.215  < 2e-16 ***
## horsepower          -4.967e-03  2.607e-04 -19.055  < 2e-16 ***
## cut_model_year74-76  3.946e-02  2.181e-02   1.809   0.0714 .  
## cut_model_year77-79  1.718e-01  2.104e-02   8.169 8.14e-15 ***
## cut_model_year80-82  3.015e-01  2.345e-02  12.857  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1326 on 308 degrees of freedom
## Multiple R-squared:  0.8444, Adjusted R-squared:  0.8419 
## F-statistic: 334.3 on 5 and 308 DF,  p-value: < 2.2e-16
summary(fit_median)
## 
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year, 
##     data = imp_ds_median)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41919 -0.08380 -0.00065  0.08807  0.41444 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.913e+00  3.931e-02  99.550  < 2e-16 ***
## weight              -1.400e-04  1.471e-05  -9.516  < 2e-16 ***
## horsepower          -5.252e-03  2.539e-04 -20.684  < 2e-16 ***
## cut_model_year74-76  3.469e-02  2.214e-02   1.567    0.118    
## cut_model_year77-79  1.676e-01  2.136e-02   7.849 7.01e-14 ***
## cut_model_year80-82  2.987e-01  2.384e-02  12.530  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1349 on 308 degrees of freedom
## Multiple R-squared:  0.839,  Adjusted R-squared:  0.8364 
## F-statistic: 321.1 on 5 and 308 DF,  p-value: < 2.2e-16
summary(fit_list)
## 
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year, 
##     data = imp_del_list)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34241 -0.05980 -0.00390  0.06574  0.35892 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.934e+00  3.288e-02 119.643  < 2e-16 ***
## weight              -2.712e-04  1.947e-05 -13.929  < 2e-16 ***
## horsepower          -1.574e-03  4.984e-04  -3.158   0.0018 ** 
## cut_model_year74-76  8.591e-02  2.117e-02   4.057 6.78e-05 ***
## cut_model_year77-79  1.834e-01  2.013e-02   9.113  < 2e-16 ***
## cut_model_year80-82  2.958e-01  2.090e-02  14.151  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1108 on 232 degrees of freedom
## Multiple R-squared:  0.8512, Adjusted R-squared:  0.848 
## F-statistic: 265.5 on 5 and 232 DF,  p-value: < 2.2e-16
summary(fit_reg)
## 
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year, 
##     data = imp_ds_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31674 -0.07332 -0.00046  0.07595  0.34464 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.837e+00  2.965e-02 129.392  < 2e-16 ***
## weight              -2.626e-04  1.871e-05 -14.030  < 2e-16 ***
## horsepower          -7.398e-04  4.496e-04  -1.646 0.100842    
## cut_model_year74-76  7.045e-02  2.006e-02   3.512 0.000511 ***
## cut_model_year77-79  1.813e-01  1.900e-02   9.542  < 2e-16 ***
## cut_model_year80-82  3.107e-01  2.121e-02  14.648  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1198 on 308 degrees of freedom
## Multiple R-squared:  0.8729, Adjusted R-squared:  0.8709 
## F-statistic: 423.1 on 5 and 308 DF,  p-value: < 2.2e-16
summary(fit_rf)
## 
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year, 
##     data = imp_ds_rf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31490 -0.06511  0.00235  0.07253  0.35447 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.893e+00  2.886e-02 134.863  < 2e-16 ***
## weight              -2.776e-04  1.702e-05 -16.305  < 2e-16 ***
## horsepower          -8.072e-04  3.902e-04  -2.068 0.039429 *  
## cut_model_year74-76  6.501e-02  1.861e-02   3.494 0.000546 ***
## cut_model_year77-79  1.713e-01  1.766e-02   9.700  < 2e-16 ***
## cut_model_year80-82  2.960e-01  1.978e-02  14.965  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1124 on 308 degrees of freedom
## Multiple R-squared:  0.8882, Adjusted R-squared:  0.8864 
## F-statistic: 489.4 on 5 and 308 DF,  p-value: < 2.2e-16
summary(fit_gam)
## 
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year, 
##     data = imp_ds_gam)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31702 -0.06743  0.00179  0.07344  0.35273 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.879e+00  2.892e-02 134.116  < 2e-16 ***
## weight              -2.708e-04  1.707e-05 -15.871  < 2e-16 ***
## horsepower          -9.055e-04  3.941e-04  -2.298 0.022255 *  
## cut_model_year74-76  6.705e-02  1.889e-02   3.550 0.000445 ***
## cut_model_year77-79  1.774e-01  1.795e-02   9.886  < 2e-16 ***
## cut_model_year80-82  2.999e-01  2.005e-02  14.960  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1138 on 308 degrees of freedom
## Multiple R-squared:  0.8854, Adjusted R-squared:  0.8836 
## F-statistic:   476 on 5 and 308 DF,  p-value: < 2.2e-16

3.3.1.2 Plot of the residual

set_ds  <- list(imp_ds_mean, imp_ds_median, imp_del_list, imp_ds_reg, imp_ds_rf, imp_ds_gam)
set_fit <- list(fit_mean,    fit_median,    fit_list,      fit_reg,    fit_rf,    fit_gam)

plots_list <- vector("list", length(set_ds))  # pre-allocate

for (i in seq_along(set_ds)) {
  plots_list[[i]] <- my_diagnostic_plots(set_fit[[i]], set_ds[[i]], blue_nord)
}

Residual analysis mean

plots_list[[1]]

Residual analysis median

plots_list[[2]]

Residual analysis deletion

plots_list[[3]]

Residual analysis regression

plots_list[[4]]

Residual analysis random forest

plots_list[[5]]

Residual analysis gam

plots_list[[6]]

3.3.1.3 Coefficient analysis

# Define models
fit_models <- list(
  mean = fit_mean, median = fit_median, list = fit_list,
  reg = fit_reg, rf = fit_rf, gam = fit_gam, or = fit_original
)

# Full coefficient extraction and data frame creation
coef_df <- extract_coefficients(fit_models)

3.3.2 Intercept

# Selected coefficient example: Intercept
intercept_df <- extract_coefficients(fit_models, coef_names = "(Intercept)")
intercept_df
# Plotting Intercept
plot_coefficients(intercept_df, "Comparison of Intercept Coefficients with Confidence Intervals", y_limits = c(3.65, 4.05))

3.3.3 Weight

# Selected coefficient example: Weight
weight_df <- extract_coefficients(fit_models, coef_names = "weight")
weight_df
# Plotting Weight
plot_coefficients(weight_df, "Comparison of Weight Coefficients with Confidence Intervals")

3.3.4 Horsepower

# Extract specific coefficients and their confidence intervals
coef_names <-"horsepower" # Specify the coefficients you want to compare
# Selected coefficient example: Horsepower
horsepower_df <- extract_coefficients(fit_models, coef_names = "horsepower")
horsepower_df
# Plotting Horsepower
plot_coefficients(horsepower_df, "Comparison of Horsepower Coefficients with Confidence Intervals")

3.3.5 Cut_model_year

# Selected coefficients: Multiple terms (e.g., cut_model_year)
cut_model_year_df <- extract_coefficients(fit_models, coef_names = c("cut_model_year74-76", "cut_model_year77-79", "cut_model_year80-82"))

# Plotting Multiple Terms
plot_coefficients(cut_model_year_df, "Comparison of Cut Model Year Coefficients with Confidence Intervals")

3.3.6 Predictions based on RMSE

run_imputation_and_modeling <- function(train_data, test_data, missingMar) {
  
  missingComplex <- missingMar
  
  # ----------------------------------------------------------------
  # 2. Imputation strategies on missingComplex (or missingMar) 
  #    (Choose which you actually want to use. Below is an example.)
  # ----------------------------------------------------------------
  imp_ds_mean   <- impute_mean(missingComplex)
  imp_ds_median <- impute_median(missingComplex)
  imp_del_list  <- listwise_deletion(missingComplex)
  
  imp_ds_reg <- regression_imputation(
    missingComplex[, -which(names(missingComplex) == "car_brand")],
    noise = TRUE
  )
  imp_ds_rf  <- tree_based_imputation(
    missingComplex[, -which(names(missingComplex) == "car_brand")],
    noise = TRUE
  )
  
  # For GAM imputation, remove non-numeric or unneeded columns:
  prova       <- missingComplex[, !names(missingComplex) %in% c("car_brand","origin","cylinders","cut_model_year")]
  imp_ds_gam  <- gam_based_imputation(prova, noise = TRUE, max_predictors = 4)
  # Add back the dropped columns, if needed
  imp_ds_gam  <- cbind(imp_ds_gam, missingComplex[, c("car_brand","origin","cylinders","cut_model_year")])
  
  # ----------------------------------------------------------------
  # 3. Fit models on each imputed dataset
  # ----------------------------------------------------------------
  fit_mean   <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_mean)
  fit_median <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_median)
  fit_list   <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_del_list)
  fit_reg    <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_reg)
  fit_rf     <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_rf)
  fit_gam    <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_gam)

  # (Optional) Fit an "original" model on the *training* data itself 
  fit_original <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = train_data)

  # ----------------------------------------------------------------
  # 4. Organize models for easy iteration
  # ----------------------------------------------------------------
  models <- list(
    "Listwise deletion"               = fit_list,
    "Mean Imputation"                 = fit_mean,
    "Median Imputation"               = fit_median,
    "Linear regression (with noise)"  = fit_reg,
    "Random Forest"                   = fit_rf,
    "GAM Imputation"                  = fit_gam,
    "Original Dataset (Train Fit)"    = fit_original
  )
  
  # ----------------------------------------------------------------
  # 5. Predict & calculate RMSE on the *test set*
  # ----------------------------------------------------------------
  calculate_rmse <- function(model, new_data, true_values) {
    preds <- predict(model, newdata = new_data)
    rmse(preds, true_values)
  }
  
  results <- lapply(names(models), function(method) {
    data.frame(
      Method = method,
      RMSE   = calculate_rmse(models[[method]], test_data, test_data$mpg)
    )
  })
  
  # Return a single data frame with columns: Method, RMSE
  do.call(rbind, results)
}
data_original<-auto_mpg
K <- 5
folds <- createFolds(1:nrow(data_original), k = K, list = TRUE)

methods <- c("Listwise deletion", 
             "Mean Imputation",
             "Median Imputation",
             "Linear regression (with noise)",
             "Random Forest",
             "GAM Imputation",
             "Original Dataset (Train Fit)")

# store RMSE for each fold in a matrix
rmse_results <- matrix(NA, nrow = K, ncol = length(methods),
                       dimnames = list(paste0("Fold", 1:K), methods))

# We can also store the entire data-frame style results if you prefer
all_results <- vector("list", K)

for (i in seq_along(folds)) {
  test_idx  <- folds[[i]]
  train_idx <- setdiff(seq_len(nrow(data_original)), test_idx)
  
  train_data <- data_original[train_idx, ]
  test_data  <- data_original[test_idx, ]
  test_data$mpg<-log(test_data$mpg)

  # 2a. Induce missingness on train_data 
  missingMar<-delete_MAR_1_to_x(data_original, p = 0.2, cols_mis = "weight", cols_ctrl ="displacement", x =100)
  missingMar<-delete_MAR_censoring(missingMar, p = 0.1, cols_mis = "weight", cols_ctrl = "acceleration")

  fold_results <- run_imputation_and_modeling(train_data, test_data, missingMar)
  
  # fold_results is a data frame with columns: Method, RMSE
  all_results[[i]] <- data.frame(Fold = i, fold_results)
  
  # Also fill into the rmse_results matrix
  for (m in methods) {
    rmse_results[i, m] <- fold_results$RMSE[fold_results$Method == m]
  }
}


# Create a single data frame of all folds 
combined_results <- do.call(rbind, all_results)

# avg per method
aggregate(RMSE ~ Method, data = combined_results, FUN = mean)

The analysis revealed that advanced imputation methods, particularly random forest and GAM, yielded superior model performance, as evidenced by more accurate coefficient estimates and narrower confidence intervals. These methods were followed by regression imputation, while simpler approaches like mean imputation and listwise deletion demonstrated comparatively lower performance. This outcome aligns with expectations, as sophisticated imputation techniques are better equipped to capture complex data patterns, leading to more reliable and robust predictive models.

3.4 Multiple imputation

missingMar<-delete_MAR_1_to_x(data_original, p = c(0.3,0.3), cols_mis = c("weight", "horsepower"), cols_ctrl =c("displacement", "acceleration"), x =10)

idx_miss <- which(!complete.cases(missingMar))
df_no_miss <- missingMar[complete.cases(missingMar), ]

tmp<-delete_MAR_1_to_x(df_no_miss, p = c(0.3), cols_mis = c("displacement"), cols_ctrl =c("horsepower"), x =100)

missingComplex <- missingMar
missingComplex[complete.cases(missingMar), ] <- tmp

# Convert categorical variables to factors before imputation
# missingComplex$cut_model_year <- as.factor(missingComplex$cut_model_year)

#plot VIM
aggr_plot <- aggr(missingComplex, numbers = TRUE,labels=names(missingComplex), prop = c(TRUE, FALSE),  col = nord_contrast)

missing.rows = dim(missingComplex)[1] -  dim(na.omit(missingComplex))[1]

missings_df <- data.frame(type=c("missing", "non-missing") ,count = c(missing.rows,  dim(na.omit(missingComplex))[1]))

ggplot(missings_df, aes(fill=type, y="", x=count)) + 
    geom_bar(position="stack", stat="identity")+
    ggtitle("Missing data frequency") +
    xlab("Obs") + ylab("") +
    theme(text = element_text(size = 18))+
    scale_fill_manual(values = c(nord_colors[2], nord_colors[3]))+
    theme_bw()

sprintf("Missing rows: %s (%s%%)", missing.rows, round((missing.rows*100)/dim(missingComplex)[1], 2))
## [1] "Missing rows: 263 (67.26%)"
missingComplex_no_year <- subset(missingComplex, select = -cut_model_year)
# Assuming your data frame is named 'df'
no_categorical_dataset <- Filter(function(x) is.numeric(x), missingComplex)

3.4.0.1 In this section we test:

  • Listwise deletion
  • Mean imputation
  • Custom Gam imputation
  • Custom Random Forest imputation
complex_ds_mean<-impute_mean(missingComplex)
complex_ds_del<-listwise_deletion(missingComplex)

# Original model
fit_original_bad <- lm(I(log(mpg))~weight+horsepower,data = auto_mpg)
# Custom models 
fit_complex_mean<-lm(I(log(mpg))~weight+horsepower, data = complex_ds_mean)
fit_complex_del<-lm(I(log(mpg))~weight+horsepower, data = complex_ds_del)
set_ds  <- list(complex_ds_del, complex_ds_mean)
set_fit <- list(fit_complex_del, fit_complex_mean)

plots_list <- vector("list", length(set_ds))  # pre-allocate

for (i in seq_along(set_ds)) {
  plots_list[[i]] <- my_diagnostic_plots(set_fit[[i]], set_ds[[i]], blue_nord)
}

# Now `plots_list` is a list of patchwork objects. You can print them one at a time:
plots_list[[1]]

plots_list[[2]]

# Define models and corresponding prediction variables
models <- list(
  "Listwise deletion" = fit_complex_del,
  "Mean Imputation" = fit_complex_mean,
  "Linear regression (with noise)" = final_predictions_reg,
  "GAM Imputation" = final_predictions_gam,
  "Random Forest" = final_predictions_tree,
  "Original Dataset" = fit_original_bad
)

# Function to predict and calculate RMSE
calculate_rmse <- function(model, new_data, true_values, is_final = FALSE) {
  # Handle models that already contain final predictions
  predictions <- if (is_final) model else predict(model, newdata = new_data)
  rmse(predictions, true_values)
}

# Flag for final predictions
final_methods <- c("Linear regression (with noise)", "GAM Imputation", "Random Forest")

# Calculate RMSE for each model
results <- lapply(names(models), function(method) {
  rmse_value <- calculate_rmse(
    models[[method]],
    test_data,
    test_data$mpg,
    is_final = method %in% final_methods
  )
  data.frame(Method = method, RMSE = rmse_value)
})

# Combine results into a single data frame
results_df <- do.call(rbind, results)

# Optional: View results in a structured format
results_df

4 Conclusions

In this project, we analyzed the problem of missing data, focusing on the Missing at Random (MAR) missingness mechanism. We studied several imputation techniques, as discussed in the literature, and assessed their performance through two main approaches. First, we evaluated how well the distribution of the imputed dataset approximated the original one. Second, we examined the impact of different imputation methods on model fitting, specifically how the estimates of coefficients and the Root Mean Square Error (RMSE) were affected by imputation.

Based on our observations, we can draw the following conclusions:

  • Advanced Imputation Techniques: More sophisticated imputation methods, such as those based on GAM and Random Forests RF, proved to be highly accurate. When the percentage of missing data is relatively small, these methods produce reliable results.

  • Listwise Deletion with Simple Models: When applying listwise deletion to impute missing values, the results remained relatively robust, particularly when fitting less flexible models, such as linear regression. These simple models did not perform poorly even under listwise deletion.

  • Simple imputation and by regression: Simple imputation usually resulted in the worst results, thus proving to be a weak technique. Imputation by regression, especially in the case study, worked enough well, and if the data were linearly related it performed as well as GAM.

4.0.1 Suggestions for Future Work:

  1. Categorical Variables: Our analysis focused on continuous variables. Future work could extend this analysis to categorical variables, exploring the impact of removing values from different factors or the interaction between categorical variables and missing data in imputation.

  2. Bayesian Approaches: This project utilized classical statistical methods with a frequentist approach. Incorporating Bayesian methods, as demonstrated by the MICE (Multivariate Imputation by Chained Equations) algorithm, could provide an interesting avenue for future work. Many studies suggest that Bayesian approaches may offer optimal solutions for imputing missing values, and integrating these methods into our analysis could enhance the results.

  3. Missing Not at Random (MNAR): We did not address the scenario of missing data that is Missing Not at Random (MNAR), which is often the most challenging case. In such cases, missingness depends on unobserved data, rendering common imputation techniques insufficient. This is an important topic to explore, and we propose future developments in this area to improve the handling of MNAR data.

  4. Noise: We tried different methods of adding noise, but this approach appears to be both “simple” (as supported by the literature) and effective. However, the drawback might be that it does not adequately account for the heteroscedasticity of the dataset.

4.1 Appendix

4.1.0.1 Notes on Missing Data Mechanisms

  • MCAR: Missingness is unrelated to data values.
  • MAR: Missingness depends on observed values.
  • MNAR: Missingness depends on unobserved data (not really interesting to deal with) > Can pretty much always be reconducted to MAR

4.1.0.2 References